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Abstract 

We present an extension of sparse coding to the problems of multitask and transfer learning. 
The central assumption of the method is that the tasks parameters are well approximated 
by sparse linear combinations of the atoms of a dictionary on a high or infinite dimen- 
sional space. This assumption, together with the large quantity of available data in the 
multitask and transfer learning settings, allows a principled choice of the dictionary. We 
provide bounds on the generalization error of this approach, for both settings. Preliminary 
experiments indicate the advantage of the sparse multitask coding method over single task 
learning and a previous method based on orthogonal and dense representation of the tasks. 



1 Introduction 



The last decade has witnessed many efforts of the machine learning community to exploit as- 
sumptions of sparsity in the design of algorithms. A central development in this respect is the 
Lasso ll29l , which estimates a linear predictor in a high dimensional space under a regularizing 
l\ -penalty. Theoretical results guarantee a good performance of this method under the assump- 
tion that the vector corresponding to the underlying predictor is sparse, or at least has a very 
small £i-norm, see for example |[TT1 [T2l l3TTl and references therein. 

In this work, we consider the case where the predictors are linear combinations of the atoms 
of a dictionary of linear functions on a high or infinite dimensional space, and we assume 
that we are free to choose the dictionary. We will show that a principled choice is possible, 
if there are many learning problems, or "tasks", and there exists a dictionary allowing sparse, 
or nearly sparse representations of all or most of the underlying predictors. In such a case 
we can then exploit the larger quantity of available data to estimate the "good" dictionary and 
still reap the benefits of the Lasso for the individual tasks. This paper gives theoretical and 
experimental justification of this claim, both in the domain of multitask learning, where the new 
representation is applied to the tasks from which it was generated, and in the domain of learning 
to learn, where the dictionary is applied to new tasks of the same environment. 

Our work combines ideas from sparse coding [|25ll26l , multitask learning [fTl |3l [TOl [T3l [T4l 
and learning to learn flTJ [30J. There is a vast literature on these subjects and the list of papers 
provided here is necessarily incomplete. Learning to learn (also called inductive bias learning 
or transfer learning) has been proposed by Baxter [7] and an error analysis is provided therein, 
showing that a common representation which performs well on the training tasks will also 
generalize to new tasks obtained from the same "environment". The precursors of the analysis 
presented here are [|23l and Il22l . The first paper provides a bound on the reconstruction error 
of sparse coding and may be seen as a special case of the ideas presented here in the case of 
infinite sample size. The second paper provides a learning to learn analysis of the multitask 
feature learning method in [3]. There are other works such as [fT9ll27l which have explored the 
application of sparse coding for supervised learning. The main idea pursed in those papers is to 
simultaneously learn a dictionary from the input data and at the same time use the coding vectors 
as features for a supervised learning algorithm. Such features are a non-linear transformation 
of the input and the tasks are assumed to be related because they share the same dictionary used 
to produce this representation. In our approach, we seek a dictionary which represents well 
the tasks' regression vectors, assuming that these are a sparse combination of the dictionary 
elements. In other words, the feature learned by our method are a linear transformation of the 
input data and sparsity is enforced on the tasks' regression coefficients. We note that at the time 
of the paper writing a method very similar to ours has been proposed for multitask learning ifTTl . 
Here we present a probabilistic analysis which complements well with the practical insights 
in IfTTl , highlight the connection to sparse coding ||25l and address the different problem of 
learning to learn. 

The paper is organized in the following manner. In Section [2l we set up our notation and 
introduce the learning problem. In Section [3l we present our learning bounds for multitask 
learning and learning to learn. In Section @] we report on numerical experiments. Section \5\ 
contains concluding remarks. 
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2 Method 



In this section, we turn to a technical exposition of the proposed method, introducing some 
necessary notation on the way. 

Let if be a finite or infinite dimensional Hilbert space with inner product (■,■), norm ||-||, 
and fix an integer K. We study the problem 



• T>k is the set of /^-dictionaries (or simply dictionaries), which means that every D G T>k 
is a linear map D : R A — > H, such that ||Z?efc|| < 1 for every one of the canonical basis 
vectors of IR X . The number K can be regarded as one of the regularization parameters 
of our method. 

• C a is the set of vectors 7 in ~R K satisfying 1 1 0^ 1 1 1 < a - The £i-norm constraint implements 
the assumption of sparsity and a is the other regularization parameter. Different sets C a 
could be readily used in our method, such as those associated with £ p -norms or mixed- 
norm, see e.g. lfT51 . 

• Z = ((xti,yu) I < i < rn, 1 < t < T) is a dataset on which our algorithm operates. 
Each x t i G H represents an input vector, and yu is a corresponding real valued la- 
bel. We also write Z = (X, Y) = (z u . . . , z T ) = ((xi, yi) , . . . , (x T , y r )) with x t = 
(x t i, . . . , x tm ) an d y* = (yti, ■ ■ ■ , Vtm)- The index t identifies a learning task, and z t are 
the corresponding training points, so the algorithm operates on T tasks, each of which is 
represented by m example pairs. 

• £ is a loss function where £ (y, y') measures the loss incurred by predicting y when the 
true label is y'. We assume that £ has values in [0, 1] and has Lipschitz constant L in the 
first argument for all values of the second argument. 

The minimum in (12.11 ) is zero if the data is generated according to a noise-less model which 
postulates that there is a "true" dictionary D* E U K * with K* atoms and vectors 7*, . . . , 7^ 
satisfying l^l^ < a*, such that an input x G H generates the label y = (D*^* t ,x) in the 
context of task t. If K > K* and a > a* then the minimum in (12.11) is zero. In Section HI we 
will present experiments with such a generative model, when noise is added to the labels, that 
is y = (-D*7i , x) + C with ( ~ J\f (0, a), the standard normal distribution. 

The method (12.11 ) should output a minimizing dictionary D (Z) G V K as well as minimizing 
codes 7 X (Z) , . . . , 7 T (Z) corresponding to the different tasks. Our implementation, described 
below, does not guarantee exact minimization, because of the non-convexity of the problem. 
Below predictors are always linear, specified by a vector w G H, predicting the label (w, x) for 
an input x G H, and a learning algorithm is a rule which assigns a predictor A (z) to a given 
data set z = ((x u y^ : 1 < i < m) G (H X M) m 

We note that a method similar to (12.11) has been proposed in ifTTl . where the Frobenius norm 
on the dictionary is used in place of the £ 2 /^oo-norm employed here. 




(2.1) 



where 
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3 Learning bounds 



In this section, we present learning bounds for method (12.11) , both in the multitask learning and 
learning to learn settings, and discuss the special case of sparse coding. 

3.1 Multitask learning 

Let /x 1 , . . . , jj, T be probability measures on H x M. We interpret fj, t (x, y) as the probability of ob- 
serving the input/output pair (x, y) in the context of task t. For each of these tasks an i.i.d. train- 
ing sample z t = ((x t i, yu) : 1 < 2 < m) is drawn from (fi t ) m and the ensemble Z ~ YlJ=i 
is input to algorithm (I2.ll ). Upon returning of a minimizing D (Z) and 7 X (Z) , . . . , 7 T (Z), we 
will use the predictor D (Z) j t (Z) on the t-th task. The average over all tasks of the expected 
error incurred by these predictors is 



[£((D(Z) lt (Z),x),y)}. 

t=i 

We compare this task-average risk to the minimal analogous risk obtainable by any dictionary 
D and any set of vectors 7^ . . . , j T e C a . Our first result is a bound on the excess risk. 

Theorem 1. Let 5 > and let . . . , \i T be probability measures on H x R. With probability 
at least 1 — 5 in the draw ofZ ~ n^Li we have 

T T 

1 ^ E ( ^ t [£ {{D (Z) 7t (Z) , x) , y)] - M - E (^)~M t P (Pt. . 2/)] 



<In 2S 1 (X) (if + 12) | La i /8g 00 (X)ln(2iT) , _ /81b 4/5 



mT V m V mT 

vv/zere S 1 (X) = ~ tr (p ( x <)) ^ (X) = ^ ^Li V H [p (xt)J- Here S (x t ) 15 f/ie 
empirical covariance of the input data for the t-th task, tr (■) denotes the trace and A max (-) the 
largest eigenvalue. 

We state several implications of this theorem. 

1 . The quantity Si (X) appearing in the bound is just the average square norm of the input 
data points, while (X) is roughly the average inverse of the observed dimension of 
the data for each task. Suppose that H = W 1 and that the data-distribution is uniform 
on the surface of the unit ball. Then Si (X) = 1 and for m <C d it follows from Levy's 
isoperimetric inequality (see e.g. lfT8lO that S^, (X) rs 1/m, so the corresponding term 
behaves like \/ln K/m. If the minimum in (12.11) is small and T is large enough for 
this term to become dominant then there is a significant advantage of the method over 
learning the tasks independently. If the data is essentially low dimensional, then S^ (X) 
will be large, and in the extreme case, if the data is one-dimensional for all tasks then 
Soo (X) = Si (X) and our bound will always be worse by a factor of In K than standard 
bounds for independent single task learning as in [[6]|. This makes sense, because for low 
dimensional data there can be little advantage to multi-task learning. 
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2. In the regime T < K the bound is dominated by the term of order a/ Si (X) K/mT > 
a/ Si (X) jvn. This is easy to understand, because the dictionary atoms Dek can be chosen 
independently, separately for each task, so we could at best recover the usual bound for 
linear models and there is no benefit from multi-task learning. 

3. Consider the noiseless generative model mentioned in Section [2] If K > K* and a > 
a* then the minimum in (12.11) is zero. In the bound the overestimation of K* can be 
compensated by a proportional increase in the number of tasks considered and an only 
very minor increase of the sample size m, namely m — > (In K* / In K) m. 

4. Suppose that we concatenate two sets of tasks. If the tasks are generated by the generative 
model described in Section [2] then the resulting set of tasks is also generated by such a 
model, obtained by concatenating the lists of atoms of the two true dictionaries D\ and 
D% to obtain the new dictionary D* of length K* = K* + K% and taking the union of 
the set of generating vectors {Tt 1 }^! and {y* 2 }^=i, extending them to 1S, K ^ +K 2 so that 
the supports of the first group are disjoint from the supports of the second group. If 
Ti = T 2 , K* = K2 and we train with the correct parameters, then the excess risk for the 
total task set increases only by the order of 1/ ^fm, independent of K, despite the fact 
that the tasks in the second group are in no way related to those in the first group. This 
is directly related to avoiding negative transfer. Negative transfer happens in situations 
when we attempt to "transfer knowledge" between unrelated tasks, thereby decreasing 
the statistical performance. The bound in Theorem 1 suggests that our method avoids 
negative transfer by implicitly finding the right clusters of mutually related tasks. 

5. Consider the alternative method of subspace learning (SL) where C a is replaced by an 
euclidean ball of radius a. With similar methods one can prove a bound for SL where, 
apart from slightly different constants, y/hiK above is replaced by K. SL will be suc- 
cessful and outperform the proposed method, whenever K can be chosen small, with 
K < m and the vector 7^ utilize the entire span of the dictionary. For large values of 
K, a correspondingly large number of tasks and sparse 7^ the proposed method will be 
superior. 

The proof of Theorem[D which is given in Section |BT| of the supplementary appendix, uses 
standard methods of empirical process theory, but also employs a concentration result related 
to Talagrand's convex distance inequality to obtain the crucial dependence on Soo (X). At the 
end of Section IBTI we sketch applications of the proof method to other regularization schemes, 
such as the one presented in [fTTl . 

3.2 Learning to learn 

There is no absolute way to assess the quality of a learning algorithm. Algorithms may perform 
well on one kind of task, but poorly on another kind. It is important that an algorithm performs 
well on those tasks which it is likely to be applied to. To formalize this, Baxter introduced 
the notion of an environment, which is a probability measure 8 on the set of tasks. Thus £ (r) is 
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the probability of encountering the task r in the environment £, and p T (x, y) is the probability 
of finding the pair (x, y) in the context of the task r. 

Given £ the transfer risk (or simply risk) of a learning algorithm A is defined as follows. 
We draw a task from the environment, r ~ £, which fixes a corresponding distribution p T on 
H x R. Then we draw a training sample z ~ /i™ and use the algorithm to compute the predictor 
A (z). Finally we measure the performance of this predictor on test points (x, y) ~ p T . The 
corresponding definition of the transfer risk of A reads as 

R £ (A) = E t ^ £ E z ^E {x ^ t [£ ((A (z) , x) , y)\ , 

which is simply the expected loss incurred by the use of the algorithm A on tasks drawn from 
the environment £. 

For any given dictionary D 6 T>k we consider the learning algorithm Ad, which for z 6 Z rn 
computes the predictor 

^ m 

A D (z) = D arg min — V] £ ((D7, z<) , y 4 ) . 

i=i 

Equivalently, we can regard Ad as the Lasso operating on data preprocessed by the linear map 
D T , the adjoint of D. 

We can make a single observation of the environment £ in the following way: one first draws 
a task r ~ £. This task and the corresponding distribution p T are then observed by drawing 
an i.i.d. sample z from p T , that is z ~ p™. For simplicity the sample size m will be fixed. 
Such an observation corresponds to the draw of a sample z from a probability distribution p £ 
on (H x R) m which is defined by 

p £ (z):=E r ^ [(/O m (*)]. 

To estimate an environment a large number T of independent observations is needed, corre- 
sponding to a vector Z = (z 1; . . . , z T ) E ((H x R) TO ) T drawn i.i.d. from p £ , that is Z ~ (p £ ) T - 
We now propose to solve the problem (12.11) with the data Z, ignore the resulting r y i (Z), but 
retain the dictionary D (Z) and use the algorithm Ad(z) on future tasks drawn from the same 
environment. The performance of this method can be quantified as the transfer risk R £ (Ad(z)) 
as defined above in (13.21) and again we are interested in comparing this to the risk of an ideal 
solution based on complete knowledge of the environment. For any fixed dictionary D and task 
r the best we can do is to choose 7 G C so as to minimize [£ ((-D7, x) , y)\, so the best 

is to choose D so as to minimize the average of this over r ~ £. The quantity 

R opt = min E T ^ £ min E, x y) ^ £ [{(Dj, x) , y)] 
Dev K -fec a 

thus describes the optimal performance achievable under the given constraint. Our second result 
is 

Theorem 2. With probability at least 1 — 5 in the multisample Z = (X, Y) ~ p T £ we have 

r, / a \ T r , /27r5! (X) r (2 + lnK) /8 In 4/5 
i?e (^D(Z)) - i?o P t < j. 1 ^ + 4La^ °° 1 M m + ^ — ^A, 
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where Si (X) is as in Theorem\T\and {£) := E T ^gE( Xjy )^ jU mA max (x)J. 

We discuss some implications of the above theorem. Some of these are analogous to the 
remarks following Theorem [T] 

1. The interpretation of Soo (£) is analogous to that of S^ (X) in the bound for Theorem [IJ 
The same applies to Remark 6 following Theorem CD 

2. In the regime T < K 2 the result does not imply any useful behavior. On the other and, if 
T 3> K 2 the dominant term in the bound is of order a/ Soo W) l m - 

3. There is an important difference with the multitask learning bound, namely in Theorem 
[2] we have yT in the denominator of the first term of the excess risk, and not \JmT as in 
TheoremQ] This is because in the setting of learning to learn there is always a possibility 
of being misled by the draw of the training tasks. This possibility can only decrease as T 
increases - increasing m does not help. 

The proof of Theorem |2] is given in Section IBT21 of the supplementary appendix and follows 
the method outlined in ||22||: one first bounds the estimation error for the expected empirical risk 
on future tasks, and then combines this with a bound of the expected true risk by said expected 
empirical risk. The term Kj yT may be an artifact of our method of proof and the conjecture 
that it can be replaced by ^JK/T seems plausible. 



3.3 Connection to sparse coding 

We discuss a special case of Theorem[2]in the limit m — > oo, showing that it subsumes the sparse 
coding result in [23]. To this end, we assume the noiseless generative model y ti = (w t ,x t i) 
described in Section]!] that is p(x, y) = p(x)5(y, (w, x)), where p is the uniform distribution on 
the sphere in R d (ie. the Haar measure). In this case the environment of tasks is fully specified 
by a measure p on the unit ball in W 1 from which a task w E M. d is drawn and the measure p is 
identified with the vector w. Note that we do not assume that these tasks are obtained as sparse 
combinations of some dictionary. Under the above assumptions and choosing i to be the square 
loss, we have that E^ Xjy )^^ t £({w, x), y) = \\w t — w\\ 2 . Consequently, in the limit of m — > oo 
method (12.11) reduces to a constrained version of sparse coding (2511261, namely 

1 T 

min — > min ||Z>y — u>+|| 2 . 
DeVK T 6i 7£Ca 

In turn, the transfer error of a dictionary D is given by the quantity R(D) := min 7gCct H-D7— w\\ 2 
and -R opt = min^x^ E^^ p min 7eCct H-D7 — w\\ 2 . Given the constraints D e V K , 7 e C a and 
||x|| < 1, the square loss i (y, y') = (y — y') 2 , evaluated at y = (Dj, x), can be restricted to the 
interval y E [—a, a], where it has the Lipschitz constant 2 (1 + a) for any y' E [—1, 1], as is 
easily verified. Since Si(X) = 1 and {£) < 00, the bound in Theorem [2] becomes 



R(D) - R opt < 2a(l + a)KJ^ + 8^-^- (3.1) 
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in the limit m — > oo. The typical choice for a is a < 1, which ensures that H-D7H < 1. In this 
case inequality (|3.1I) provides an improvement over the sparse coding bound in ll23l (cf. Theo- 
rem 2 and Section 2.4 therein), which contains an additional term of the order of a/ (hiT) JT and 
the same leading term in K as in (13.11) but with slightly worse constant (14 instead of 4v / 2~7r). 
The connection of our method to sparse coding is experimentally demonstrated in Section 1431 
and illustrated in Figure [51 



4 Experiments 

In this section, we present experiments on a synthetic and a real datasets. The aim of the 
experiments is to study the statistical performance of the proposed method, in both settings of 
multitask learning and learning to learn. We compare our method, denoted as Sparse Coding 
Multi Task Learning (SC-MTL), with single task learning (independent ridge regression, RR) 
as a base line and multitask feature learning (MTFL) Q. We also report on sensitivity analysis 
of the proposed method versus different number of parameters involved. 



4.1 Optimization algorithm 



We solve problem (12.11) by alternating minimization over the dictionary matrix D and the code 
vectors 7. The techniques we use are very similar to standard methods for sparse coding and 
dictionary learning, see [15J and references therein for more information. Briefly, assuming 
that the loss function i is convex and has Lipschitz continuous gradient, either minimization 
problem is convex and can be solved efficiently by proximal gradient methods, e.g. (HI HL 
The key ingredient in each step is the computation of the proximity operator, which in either 
problem has a closed form expression. 



4.2 Toy experiment 
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Figure 1 : Multitask error (Left) and Transfer error (Right) vs. number of training tasks T. 
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Figure 2: Multitask error (Left) and Transfer error (Right) vs. number of atoms K' of used by 
our method. 





Figure 3: Multitask error (Left) and Transfer error (Right) vs. sparsity ratio s/K. 



We generated a synthetic environment of tasks as follows. We choose a d x K matrix D 
by sampling its columns independently from the uniform distribution on the unit sphere in W 1 . 
Once D is created, a generic task in the environment is given by w = Z>y, where 7 is an s- 
sparse vector obtained as follows. First, we generate a set J C {1, . . . , K} of cardinality s, 
whose elements (indices) are sampled uniformly without replacement from the set {1, ... , K}. 
We then set 7^ = if j ^ J and otherwise sample 7^ ~ A/"(0,0.1). Finally, we normalize 
7 so that it has £i-norm equal to some prescribed value a. Using the above procedure we 
generated T tasks w t = Dj t , t — 1, . . . , T. Further, for each task t we generated a training set 
z t = {(xti, Uti)}iLi, sampling x t i i.i.d. from the uniform distribution on the unit sphere in W L . 
We then set y ti = (w t , x ti ) + £ ti , with £ u ~ jV(0, a 2 ), where a is the variance of the noise. This 
procedure also defines the generation of new tasks in the transfer learning experiments below. 
We note that since the input distribution is uniform on a high dimensional sphere, neither sparse 
coding nor PCA will produce a useful representation, so there is no point in comparing to these 
methods here. 

The above model depends on seven parameters: the number K and the dimension d of the 
atoms, the sparsity s and the fx-norm a of the codes, the noise level a, the sample size per task 
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m and the number of training tasks T. In all experiments we report both the multitask learning 
(MTL) and learning to learn (LTL) performance of the methods studied. For MTL, we measure 
performance by the estimation error 1/T J2t=i \\ w t — w t \\ 2 , where t&i, . . . , wt are the estimated 
task vectors (in the case of our method w t = D(Z)j(Z) t - see the discussion in Section[2] For 
LTL, we use the same quantity but with a new set of tasks generated by the environment (in the 
experiment below we generate 100 new tasks). The regularization parameter of each method 
is chosen by cross validation. Finally, all experiments are repeated 50 times, and the average 
performance results are reported in the plots below. 

In the first experiment, we fix K = 10, d = 20, s = 2, a = 10, m = 10, a = 0.1 and study 
the statistical performance of the methods as a function of the number of tasks. The results, 
shown in Figured] clearly indicate that the proposed method outperforms both ridge regression 
and multitask feature learning. In this experiment the number of atoms used by our method, 
which here we denote by K' to avoid confusion with the number of atoms K of the target 
dictionary, was equal to K — 10, which gives an advantage to our method. We therefore also 
studied the performance of the method in dependence on K' . Figure |2l reporting this result, 
is in qualitative agreement with our theoretical analysis: the performance of the method is not 
too sensitive to K' if K' > K, and the method still outperforms independent task learning 
and multitask feature learning if K' = AK. On the other hand if K' < K the performance 
of the method quickly degrades. In the last experiment we study performance vs. the sparsity 
ratio s/K. Intuitively we would expect our method to have greater advantage over multitask 
feature learning if s <C K. The results, shown in Figure |3l confirm this fact, also indicating 
that our method is outperformed by multitask feature learning method as sparsity becomes less 
pronounced (s/K > 0.6). 

4.3 Sparse coding of images with missing pixels 

In the next experiment we consider a sparse coding problem [|25| of optical character images, 
with missing pixels. We employ the Binary Alphadigits dataseo which is composed of a set 
of binary 20 x 16 images of all digits and capital letters (39 images for each character). In 
the following experiment only the digits are used. We regard each image as a task, hence the 
input space is the set of 320 possible pixels indices, while the output space is the real interval 
[0, 1], representing the gray level. We sample T = 100, 130, 160, 190, 220, 250 images, equally 
divided among the 10 possible digits. For each of these, a corresponding random set of m = 160 
pixel values are sampled (so the set of sample pixels varies from one image to another). 

We test the performance of the dictionary learned by method (|2.1I) in a learning to learn 
setting, by choosing 100 new images. The regularization parameter for each approach is tuned 
using cross validation. The results (Figure H]) indicate some advantage of the proposed method 
over trace norm regularization. For a more thorough understanding of the results, let us recall 
that MTFL assumes that there is a common representation of all data across the tasks. There- 
fore, the lack of a big improvement of MTFL is probably due to the fact that there are 10 
different groups of tasks (corresponding to the 10 digits) so that tasks in different groups need a 
different representation of the data. This is an instance of negative transfer, already mentioned 

'Available at http://www.cs.nyu.edu/~roweis/data.html 
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Figure 4: Transfer error vs. number of tasks T (Left) and vs. number of atoms K (Right) on the 
Binary Alphadigits dataset. 



in Section [3TTT In contrast, SC-MTL assumes that each task will use only a small subset of the 
learned dictionary elements, thereby overcoming this limitation. A similar trend, not reported 
here due to space constraints, is obtained in the multitask setting. Ridge regression performed 
significantly worse and is not shown in the figure. We also show as a reference the performance 
of sparse coding (SC) applied to the T complete images, each of which corresponds to a task in 
the MTL formulation. Thus SC can be seen as applying SC-MTL for image completion, when 
all pixels are known. 

With the aim of analyzing the atoms learned by the algorithm, we have carried out another 
experiment where we assume that there are 10 underlying atoms (one for each digit). We com- 
pare the resultant dictionary to that obtained by sparse coding, where all pixels are known. 
The results are shown in Figure [5] and the similarity of the dictionaries confirms the theoretical 
findings of Section [3731 

OBI VfiBKHSb 



Figure 5: Dictionaries found by SC-MTL using m = 240 pixels (missing 25% pixels) per image 
(top) and by Sparse Coding employing all pixels (bottom). 



5 Summary 

In this paper, we have explored an application of sparse coding, which has been widely used in 
unsupervised learning and signal processing, to the domains of multitask learning and learning 
to learn. Our learning bounds provide a justification of this method and offer insights into 
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its advantage over independent task learning and learning dense representation of the tasks. 
The bounds, which hold in a Hilbert space setting, depend on data dependent quantities which 
measure the intrinsic dimensionality of the data. Numerical simulations presented here, as 
well as recent empirical results in IfPTll indicate that sparse coding is a promising approach to 
multitask learning and can lead to significant improvements over competing methods. 

In the future, it would be valuable to study extensions of our analysis to more general classes 
of code vectors. For example, we could use code sets C a which arise from structured sparsity 
norms, such as the group Lasso norm, non overlapping groups lfT5l or other families of regular- 
izes. A concrete example which comes to mind is to choose K = Qr, Q, r E N and a partition 
J = {{(q — l)r + l, . . . ,qr} : q = 1, . . . ,Q} of the index set {1, . . . , K} into contiguous index 
sets of size r. Then using a norm of the type ||7|| = ||7||i + ll^jlb w ^ encourage codes 

which are sparse and use only few of the groups in J . Using the ball associated with this norm 
as our set of codes would allow to model sets of tasks which are divided into groups. 
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Appendix 

In this appendix, we present the proof of Theorems \T\ and |2[ We begin by introducing some 
more notation and auxiliary results. 

A Notation and tools 

Issues of measurability will be ignored throughout, in particular, if J 7 is a class of real valued 
functions on a domain X and X a random variable with values in X then we will always write 
E supj eJ r / (X) to mean sup {E maxj e j / (X) : JFq C J 7 , JF Q finite}. 

In the sequel H denotes a finite or infinite dimensional Hilbert space with inner product 
(•, •) and norm || • || . If T is a bounded linear operator on H its operator norm is written ||T|| = 
sup : ||x|| = 1}. 

Members of H are denoted with lower case italics such as x,v,w, vectors composed of 
such vectors are in bold lower case, i.e. x = (xi, . . . , x m ) or v = (v i, . . . , v n ), where m or n 
are explained in the context. 

An example is a pair z = (x,y) G B x R —: Z, a sample is a vector of such pairs 
z = Oi, ...,z m ) = ((xi, yi) (x m , y m )). Here we also write z = (x, y), with x = 
(xi, . . . , x m ) e H m and y = . . . , y m ) G R m . 

A multisample is a vector Z = (zi, . . . , z T ) composed of samples. We also write Z = 
(X,Y) withX = ( Xl ,...,x T ). 

For members of M. K we use the greek letters 7 or j3. Depending on context the inner product 
and euclidean norm on ~Bl K will also be denoted with (•, •) and ||.||. The ^-norm ||-|L on H K is 

denned by H/3II, = Ef=il7 fc |- 

In the sequel we denote with C a the set {/3 G ~§t K : < oi\, abbreviate C for the l\- 
unit ball C\. The canonical basis of M* K is denoted e\, . . . , e#. Unless otherwise specified the 
summation over he index i will always run from 1 to m, t will run from 1 to T, and k will run 
from 1 to K. 
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A.l Covariances 

For x eH m the empirical covariance operator £ (x) is specified by 



£ (x) v , w ) = — (v, %i) (xi, w) , v,w G H . 



The definition implies the inequality 

(f, Xi) 2 = m (t, (x) v, < m S (x) 

It also follows that tr (t, (x) j = (1/m) £\ ||^i|| 2 - 

For a multisample X G if mT we will consider two quantities defined in terms of the empir- 
ical covariances. 



T 



\t 


(*) 


1 


\t 


(x*) 


oo 



six,; 



where A max is the largest eigenvalue. If all data points x t % lie in the unit ball of H then Si (X) < 
1. Of course Si (X) can also be written as the trace of the total covariance (1/T) J2t ^ ( x *)' 
while Soo (X) will always be at least as large as the largest eigenvalue of the total covariance. 
We always have S^ (X) < Si (X), with equality only if the data is one-dimensional for all 
tasks. The quotient Si (X) / Soo (X) can be regarded as a crude measure of the effective dimen- 
sionality of the data. If the data have a high dimensional distribution for each task then Soo (X) 
can be considerably smaller than Si (X) . 



A.2 Concentration inequalities 

Let X be any space. For x G X n , 1 < k < n and y G X we use x^ y to denote the object 
obtained from x by replacing the k-th coordinate of x with y. That is 

The concentration inequality in part (i) of the following theorem, known as the bounded differ- 
ence inequality is given in E4l . A proof of inequality (ii) is given in lETTl . 

Theorem 3. Let F : X n — > R and define A and B by 

n 

A 2 = sup V, sup (F (xk<r-yi) — F (x fc <_<, 2 )) 2 

n , 

B 2 = sup V [F (x) - inf F (x^) 
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Let X = (Xi, . . . , X n ) be a vector of independent random variables with values in X, and let 
X' be i.i.d. to X. Then for any s > 

(i) Pr {F (X) > EF (X) + s} < e~ 2s ^ A2 . 

(ii) Pr {F (X) > EF (X) + s}< e " s2/ ( 2B2 ). 



A.3 Rademacher and Gaussian averages 

We will use the term Rademacher variables for any set of independent random variables, uni- 
formly distributed on { — 1,1}, and reserve the symbol a for Rademacher variables. A set of 
random variables is called orthogaussian if the members are independent J\f (0, 1) -distributed 
(standard normal) variables and reserve the letter £ for standard normal variables. The notation 
01, <7 2 , . . . , (Tj, . . . , o"n, . . . , cry etc. will always refer to independent Rademacher variables and 

Ci) Czj • • • j Ci) • • • j Chj • • • ■> dj will re f er t0 orthogaussian variables. 

For A C M n we define the Rademacher and Gaussian averages of A ( |[T8l .ll6lO as 

2 n 

■ft (A) = E CT sup - yVj^, 

(xi,...,s n )eA n i=1 
2 n 

Q(A) = E c sup -^CiXi- 

{x 1 ,...,x n )&A n i=l 

If J 7 is a class of real valued functions on a space X and x = (xi, . . . , x n ) 6 Af n we write 

J- (x) = T ( Xl , . . . , x n ) = {(/ (a*) ,...,/ (z n )) :/eJ}ci» 

The empirical Rademacher and Gaussian complexities of F on x are respectively 1Z (F (x)) 
and0(.F(x)). 

The utility of these concepts for learning theory comes from the following key-result (see 
flUEOUl), stated here in two portions for convenience in the sequel. 

Theorem 4. Let F be a real-valued function class on a space X and . . . , /i m be probability 
measures on X with product measure ^ = Y\il i i on X m . For x e X rn define 



$ (X) = SUp - ( E -M, [/ (*)] " / (**)) 



j i=i 



Then E^ [$ (x)] < E x ..„'/? (J 7 (x)). 

Proof. For any realization er = er^ . . . , cr m of the Rademacher variables 



E x ^[$(x)] = E^sup-E^V^^)-/^)) 

^ m 

< E X)X /^ MXM sup — 53<r f (/(arJ) - /(a*)) 



because of the symmetry of the measure /x x /x (x, x') = Oi A 4 ?; x Yli ( x > x')under the inter- 
change Xi -H- a^. Taking the expectation in a and applying the triangle inequality gives the 
result. ■ 
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Theorem 5. Let T be a [0, l]-valued function class on a space X, and /x as above. For 5 > 
we have with probability greater than 1 — 5 in the sample x ~ \i that for all f G T 



E*~m [/ (*)] < ^ E / 0*) + Ex~m^ (x)) + V 

1=1 

To prove this apply the bounded-difference inequality ( part (i) of Theorem[3]) to the function 
$ of the previous theorem (see e.g. [|6||). Under the conditions of this result, changing one of 
the Xi will not change 1Z (J 7 (x)) by more than 2, so again by the bounded difference inequality 
applied to 1Z (J 7 (x)) and a union bound we obtain the data dependent version 

Corollary 6. Let J 7 and /j, be as above. For 5 > we have with probability greater than 1 — 5 
in the sample x ~ /x that for all f G J 7 



- m 

K-v if {x)] <-J2f ^ + n ( jr w) + 



9 In (2/5) 



m ^-^ V 2m 

i=l 

To bound Rademacher averages the following result is very useful ll6l[Tl[T8l 

Lemma 7. Let A C W 1 , and let . . . ,ip n be real functions such that ip i (s) — ip i (t) < 
L \s — £|,Vi and s, t G R Define if} (A) = {ip l (x\) , . . . ,ip n (x n ) : (xi, . . . , x n ) G A}. T/zen 

TZ(ip(A)) < LK(A). 

Sometimes it is more convenient to work with Gaussian averages which can be used instead, 
by virtue of the next lemma. For a proof see, for example, [|T8l p. 97] 



Lemma 8. For A C R k we have TZ (A) < yhv/2 Q (A). 

The next result is known as Slepian's lemma ( ||28ll , [fT8l ). 

Theorem 9. Let Q and E be mean zero, separable Gaussian processes indexed by a common 
set S, such that 

E (fl si - fl S2 ) 2 < E (H S1 - S S2 ) 2 for all s u s 2 G S. 



Then 



B Proofs 



Esupf2 s < EsupH s . 



B.l Multitask learning 

In this section we prove Theorem [T] It is an immediate consequence of Hoeffding's inequality 
and the following uniform bound on the estimation error. 
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Theorem 10. Let 5 > 0, fix K and let . . . , /i T be probability measures on H x M. With 
probability at least 1 — 5 in the draw of Z ~ n*=i (aO we have for all D G T>k and all 
7 G C J ?/ia? 



^ T T m 

r E E (*.»)~^ [ £ ((-^7t) *> , v)] - ^ E E £ ( **) , ita) 

t=l t=l 1=1 



t=l 1=1 

/ 2fr(X)(/r + 12) / 8goo(X)ln(2^y / 91n2/<5 

V mT V m V 2mT 

The proof of this theorem requires auxiliary results. Fix X G i? mT and for 7 = (71, ... , 7^) £ 
(M^) T define the random variable 

F 1 = F 7 (a) = sup y~] a u (Dj t , x ti ) . 

Lemma 11. (i) If~y = (7^ . . . , 7 T ) satisfies ||7 t || < 1 for all t, then 

EF 7 < v/mTiT Si (X). 
(ii) If j satisfies 1 1 1 1 1 < 1 for all t, then for any s > 

Pr { F 7 >E[F 7 ] +s} <exp( 8ror -^ (x) ). 

Proof, (i) We observe that 



<_ t (eii^ii^ /2 e (e 



E a ^tk x ti 



1/2 



1/2 



(ii) For any configuration cr of the Rademacher variables let D (cr) be the maximizer in the 
definition of F 7 (cr). Then for any s G {1, . . . , T}, j G {1, . . . , m} and any a' G { — 1, 1} to 
replace a sj we have 

F 1 (cr) - F 7 (cr (sj) ^,) < 2 (cr) 7s ,^)| ■ 
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Using the inequality (IA.1I) we then obtain 

Wf 7 (<t)- inf ;/ 7 (<t (s ^)) < 4^( J D(<x) 7i ,* ii > 2 

< 4m^||s(x,) J\D{*) lt \ 

t 

< 4m^||s(x t ) 

t 

In the last inequality we used the fact that for any D G T>k we have H-D7JI < ^2 k \ltk\ ll^ e fc|| — 
II7J1 < 1- The conclusion now follows from part (ii) of Theorem [3l ■ 

Proposition 12. For every fixed Z = (X, Y) e {H X R) mT we have 

E a sup Yl a ^ « D lt, Xk) , Vti) < La^2mTS 1 (X) (K + 12) +LaT^%mS 00 (X) In (2K). 
DeD,~,e(C a f t ,i 

Proof. It suffices to prove the result for a = 1, the general result being a consequence of 
rescaling. By Lemma[7]and the Lipschitz properties of the loss function £ we have 

E a sup (T it £ {{Dj t , x ti ) ,yu) < LE a sup ^ a it (D^ t , x ti ) . 

Dev R - n e(C) T , t ,i Dev I<n e(C) T , t ,i 

Since linear functions on a compact convex set attain their maxima at the extreme points, we 
have 



T m 

E sup y^y] (T it {Dj t , x t i) =E max F 7 , 

D£V K ,y£(C) T , t=l i=l 7 eext(C) T 



E max F y — / Pr < max F 7 > s } ds 

7eext(C) T Jo t7Gext(C) T 



where F 7 is defined as in (IB.ll) . Now for any 5 > we have, since F 7 > 0, 

/oo 
Pr{F 7 >s}rfs 

7e( ext(C)r >™( X >+* 

< ^mKTSi (X) + 5 + / Pr {F 7 > EF 7 + s} ds 



7 e(ext(C)) T ' 



< V^T 5l (X) + 5+ (2^ / exp( — ) da 



/ s 2 



< v WTS 1 (X) + * + ex P^^r^(X)J- 

Here the first inequality follows from the fact that probabilities never exceed 1 and a union 
bound. The second inequality follows from Lemma ITTl part (i), since EF k < a/ mKTSi (X). 
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The third inequality follows from LemmaQT] part (ii), and the fact that the cardinality of ext(C) 
is 2K, and the last inequality follows from a well known estimate on Gaussian random variables. 

Setting 5 = J SmTSoo (X) In (2K) T ^j we obtain with some easy simplifying estimates 

E max F 1 < ^2mT (K + 12) Si (X) + T^ZmS^ (X) In (2K), 

7eext(C) T 

which together with (|B.1I) and (IB. II) gives the result. ■ 
Theorem [TOl now follows from Corollary [6] 

If the set C a is replaced by any other subset C of the £ 2 -ball of radius a, a similar proof 
strategy can be employed. The denominator in the exponent of Lemma QTKii) then obtains 
another factor of \/K. The union bound over the extreme points in ext(C) in the previous 
proposition can be replaced by a union bound over a cover C. This leads to the alternative 
result mentioned in Remark 5 following the statement of Theorem [TJ 

Another modification leads to a bound for the method presented in |fT7j|. where the constraint 
||-De fc || < 1 is replaced by ||-D|| 2 < VK (here ||.|| 2 is the Frobenius or Hilbert Schmidt norm) 
and the constraint 1 1 --y A 1 1 x < a, Vt is replaced by Y IItJIi < a T. To explain the modification 
we set a = 1. Part (i) of Lemma [111 is easily verified. The union bound over (ext(C)) in 
the previous proposition is replaced by a union bound over the 2TK extreme points of the 
^i-Ball of radius T in M. TK . For part (ii) we use the fact that the concentration result is only 
needed for 7 being an extreme point (so that it involves only a single task) and obtain the bound 

Et S(x t ) ||L> Tt || 2 <Ti^(X), leading to 

00 

Pr {F 1 > E [F 7 ] + s} < exp ( 
Proceeding as above we obtain the excess risk bound 



-s 2 



8mTK 5^ (X 



Ln 2S 1 (X)(K+12) | ^ 8KS' oc (X)\n(2KT) | ^/81n4/^ 



mT V m V mT 

to replace the bound in Theorem Q] The factor \[K in the second term seems quite weak, but 
it must be borne in mind that the constraint ||-D|| 2 < \^~K is much weaker than H-De^H < 1, 
and allows for a smaller approximation error. If we retain H-De^H < 1 and only modify the 
7-constraint to Yl WltWi — a T the \/K in the second term disappears and by comparison to 
Theorem \T\ there is only and additional InT and the switch from (X) to S'^ (X), reflecting 
the fact that Y WltWi — a ^ ls a mucn weaker constraint than 1]%^ < a, Vt, so that, again, a 
smaller minimum in (12.11 ) is possible for the modified method. 



B.2 Learning to learn 

In this section we prove Theorem |2l The basic strategy is as follows. Recall the definition (13.21) 
of the measure p £ , which governs the generation of a training sample in the environment S. On 
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a given training sample z the algorithm Ad as defined in (13.21) incurs the empirical risk 

.Rd (z) = min — V] £ ( (Z>y, Xi) ,yi) . 
7GC a m * — ' 



i=l 



The algorithm Ap, essentially being the Lasso, has very good estimation properties, so R D (z) 
will be close to the true risk of A D in the corresponding task. This means that we only really 
need to estimate the expected empirical risk E, Z ^ P£ R D (z) of A D on future tasks. On the other 
hand the minimization problem (12.11 ) can be written as 



1 T 



F m r 1^ Rd ( z *) witn Z = («!,..., 2r) ~ 
K — J 



with dictionary F (Z) being the minimizer. If V K is not too large this should be similar to 
E z ^p £ -Ro(z) (z). In the sequel we make this precise. 

Lemma 13. For v e H with \\v\\ < 1 and x £ if" 1 let F be the random variable 



2 



Tften (ij EF < y/m 



Sfx) 



1/2 



and (ii) for t > 



Pr {F > EF + s} < exp 



— s 



2m 



S(x) 



Proof, (i). Using Jensen's inequality and (IA.1I) we get 



EF < |e (v, £ <^ j = (j2 Xi > 



1/2 



< m 



E(x) 



(ii) Let <t be any configuration of the Rademacher variables. For any a', a" G { — 1, 1} to replace 
a S j we have 

F {(T(sj)^a>) - F (<J( sj )^. a „) < 2 \{v,Xj)\ , 

so the conclusion follows from the bounded difference inequality, Theorem [3] (i). ■ 
Lemma 14. For v i, . . . , vk G H satisfying \\vk\\ < 1, x G H m we have 



Emax 

k 



< W2m 



(2 + V^k) . 
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Proof. Let F k = \(vk, J2i °% x i} I ■ Using integration by parts we have for 8 > 



E max Fk < \\m 

k 



E(x) 



+ 5 + 

00 -/ ■ /m||S(x)|| +6 



max Pr {Fk > s} ds 

k 



< \ m 



Sfx) 



< -1 /m 



E(x) 



+ 5 + V / Pr > EF fc + s} ds 
k Js 

+ 5 + V / exp I — 



< A / Tfi 



E(x) 



mi^ E fx) 



5 + 



exp 



2m 



Efx) 



Above the first inequality is trivial, the second follows from Lemma \\3\(i) and a union bound, 
the third inequality follows from Lemma[[3](ii) and the last from a well known approximation. 



The conclusion follows from substitution of 5 = * 2m 



Sfx) 



ln(eit). 



Proposition 15. Let S £ : = E r ^£-E( x>y )^ At m 

multisample Z ~ p]? 



Efx) 



. With probability at least 1 — 5 in the 



sup 

Dev K 



1 T 

R £ (A d )--J2Rd (z t 

t=i 



T V m V 2T 

Proof. Following our strategy we write (abbreviating p = p £ ) 

1 T 

sup R e {A D ) - -^R D {z t ) 



Dev K 



t=i 



< sup E r ^ £ E z ^ M m E (x ^)^ t [£ ((A D (z) , x) , y)\ - R D (z) 



+ sup E z ^ p 



#D (z) 



1 



2' 



i=l 



and proceed by bounding each of the two terms in turn. 
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For any fixed dictionary D and any measure ji on Z we have 



E. 



Efx,!/)^ [I {{Ad (z) , x) , y)] - .Rd (z 



< E z ^ M m SUP 



7GC Q 



E 



[£((Dj,x),y))-- y ijr£((I}y,x i ),y i 

1=1 



< 



— E z ^mE ff sup Oil ((Dj, Xi) , yi) by TheoremH 



< — E z ^ m E ff sup 7 fc ( Defc, (TiXi ) by Lemma|7] 



< 



< 



m 

2La 



E z ^mE ff max 



i=i 



i=l 



by Holder's inequality 



2La, 



m 



E z ^ m W2mA max (x)) ^2 + Vln if) by LemmaO(i) 



< 2La 
This gives the bound 



\ 



4E z ^ m A max S(x) (2 + In if) 



rn 



by Jensen's inequality. 



E 



^(x,y)~n [Z ((A D (z) , x) , y)\ - R D (z) 



< 4La 



\ 



E z ^A max (E(x) (2 + In if) 



m 



valid for every measure /i on if x R and every .D G T>k- Replacing /i by /i r , taking the 
expectation as r ~ ^ and using Jensen's inequality bounds the first term on the right hand side 
of (IB .21) by the second term on the right hand side of (IB.ll) . 

We proceed to bound the second term. From Corollary [6] and Lemma [8] we get that with 
probability at least 1 — 5 in Z ~ (Ps) T 



sup E 



i?D(z) 



^ J] i?D (Z t ) < ^E C ^SUp ]T C^D (Z t ) + ■ 1U J ; " 



t=l 



t=l 



2T 



where ( t is an orthogaussian sequence. Define two Gaussian processes Q and E indexed by V K 
as 



i j T rn K 

C t Ro (z t ) and E D = —= ^ ^ ^ C kij (De k , x ti ) 

t=l V t=l i=l k=l 



D 
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where the ( ijk are also orthogaussian. Then for Di, D 2 € T>k 



i 

t=i 



T / 1 m 

< ^ sup — ^^((Di^Xtf) ,y«) - £{(D 2 -f,x ti ) ,t 
t=i V eC « m i=i 

— f i m V 

< L 2 sup - ^ (7, Pi* - ^ 2 *) &*) Lipschitz 

t=i ^ 6C « \ m i=i / 

< — sup V" (7, (D* - Dg) xti) 2 Jensen 

m tr 7ec. tr 



< 



L 2 a 2 



m 



L 2 a 2 



rn 



Yl Yl II ( D i ~ ^2) z*i II 2 Cauchy Schwarz 



t=i i=i fe=i 



((Die k , x ti ) - (D 2 e k , x ti )) 2 = E (E Dl - E D2 ) 2 



t=i i=i fe=i 



So by Slepian's Lemma 



T 



E sup \^(-R D (z t ) = E sup fi£)<EsupH/) 
Dev K ~ Dev K Dev 



2ir La 



T m K 



E sup ^J^J^Cfcii {De^xa) 



< 



< 



< 



L K / T m 

^ s .(eii^ii 2 ) 1/2 -c(e 



Ctki x ti 

t,i 



1/2 



/;/ 



La^K 



5> 



EE 



Ctki x t', 

t,i 

1/2 

,i2 \ 

\X t i\ 



1/2 



< LaK^/mTSi (X). 



We therefore have that with probability at least 1 — 8 in the draw of the multi sample Z ~p 2 

r. 1 1 J 
Rd (z) 



sup E^ p 

Dev K 



which in (|B.2I) combines with (IB .21) to give the conclusion 



2nS 1 (X) + _ /9 In 2/5 
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Proof of Theorem^ Let D opt and r y T the minimizers in the definition of -R op t, so that 

R opt = R T ~e&(x, v )~iA T t [(( D o P tJ T , x ) > v)\ ■ 
Re (Ad(z)) — -Ropt can be decomposed as the sum of four terms, 

Re (A D{Z )) - ^J2Rd(z) (zt)j 

\ t=l i=l / 

1 T 

+ - R Do P t (z t ) - E z „ p £ Dopt (z) 

t=l 

+Er~£ [E z ^ M m^ Dopt (z) - E^)^ [£ ((A,pt7r, x) , y)\ ■ 

By definition of R we have for every r that 

- m 

E z ^™£ Dopt (z) = E Z ^ M ™ min — V £ [((A> P t7, a*) , y*)] 

7G<--a 771 ^ ' 
i=l 

m 

< E z ^™— ^£[((D op t7 T ,Xi) = E (x . y) ^ T £[((73 opt 7 T ,x) ,y)] . 
m i=\ 

The term (IB .61) above is therefore non-positive. By Hoeffding's inequality the term (IB.5I) is 
less than yTn (2/5) /2T with probability at least 1 — 5/2. The term IB .41 is non-positive by the 
definition of D (Z). Finally we use Proposition [T5l to obtain with probability at least 1 — 5/2 
that 

(^D(z)) - = V ^d(z) (zt) < sup i?e (A D ) - - V (Zt) 



T V m V 2T 

Combining these estimates on (IB.3I) , (IB.4I) , (IB.5I) and (IB.6I) in a union bound gives the conclu- 
sion. ■ 
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